1 Introduction

In explaining the behavior of stock returns, common risk factor models are well established in the field of asset pricing. The impetus came from the famous CAPM-model of Treynor (1961), Sharpe (1964), Lintner (1965), and Mossin (1966), which was later extended to a three-factor, a four-factor, and a five-factor model by Fama and French (1993), Carhart (1997) and Fama and French (2015), respectively. The momentum factor, which extended the three-factor model to the Carhart four-factor model, was first introduced and analysed by Jegadeesh and Titman (1993). Nowadays, a zoo of factors is available (see, e.g., (Feng et al., 2020), and the references therein) with a large number of model variants. While the five factors of Fama and French (2015), and the momentum factor, remain among the most important common risk factors, a large literature now exists on how to decide whether or not to include a particular factor among the dozens, if not hundreds, available: See, e.g., Bai and Ng (2002), Stock and Watson (2002), Tsai and Tsay (2010), Bai and Ng (2013), Bai and Liao (2016), and the references therein. The use of factor models has also emerged in the realm of cryptocurrencies. A Crypto-CAPM model and a three-factor model in the spirit of Fama and French (1993) were proposed in Shen et al. (2020). More recently, Liu and Tsyvinski (2021) look at a broader range of cryptocurrency-specific factors and their predictive performance.

Despite an ever-growing impressive array of approaches to financial factor modeling, one of the most common assumptions shared across such papers regards the statistical distributional assumption of the financial asset returns. In particular, the assumption of Gaussianity is nearly ubiquitous. This implies the symmetry of each univariate stock return distribution, a rotational symmetry for the multivariate distribution, and the same, thin-tailed behavior for each margin, for both the left and right tail. This rather restrictive list turns out to be not so unrealistic for returns measured at the monthly level, but for daily returns (and higher frequency), the evidence for violation of one or more of these constraints is strong and well established; see for example Pagan (1996), Chicheportiche and Bouchaud (2012), McNeil et al. (2015), and the references therein. There is substantial evidence that asset returns are not only non-Gaussian, but also non-elliptic; see the evidence and references in McNeil et al. (2015) and Paolella (2018). Non-elliptic distributions can, for example, (i) allow the margins to be asymmetric, and for each margin to have its own asymmetry parameter; and (ii) allow the tail behavior (as a power tail law or semi-heavy tailed law) to differ for each margin.

The multivariate Student’s t is perhaps the most common non-Gaussian distribution deployed after the Gaussian. (This distributional assumption could be nested within our more general framework presented herein.) A recent example of its use in a factor asset pricing model is Kan and Zhou (2017). It is also elliptic, i.e., each margin is symmetric, there is rotational symmetry, and the tail behavior is (now power, but) the same for each margin. While arguably a better empirical choice of distribution than the Gaussian, the use of the Student’s t has the aforementioned disadvantage that it is still elliptic, but also, depending on the application, of not possessing a moment generating function; see, e.g., Paolella and Polak (2015b), and the references therein. The imposition of the same tail behavior of each constituent component is unrealistic, and is one of the reasons for the popularity of copula-based models; see Marinelli et al. (2012), Paolella and Polak (2015a, 2018), Näf et al. (2019), and the references therein. Via use of a so-called multi-tail generalized elliptical distribution, Kring et al. (2009) provide further evidence of the presence and necessity of modeling tail heterogeneity.

Of great relevance for our task at hand, some models proposed in the literature that account for the non-Gaussian, non-elliptic nature of the returns data outperform their Gaussian-based counterparts in terms of out-of-sample portfolio performance (in a variety of measures, notably lower risk and higher return). Examples include Paolella and Polak (2015b), in which the joint distribution of asset returns is allowed to follow a multivariate non-elliptic generalized hyperbolic distribution; and Paolella et al. (2019), in which a two-component Markov switching structure, each governed by symmetric generalized hyperbolic distributions (and such that the unconditional distribution is non-elliptic) is used. Other recent examples of using non-Gaussian (but not necessarily non-elliptic) distributions within the context of factor models in asset pricing include Chung et al. (2006), Zhou and Li (2016), and Bao et al. (2018). The common stylized facts of financial asset returns (e.g., heavy tails, volatility clustering, and non-ellipticity) are equally present in cryptocurrencies; see Zhang et al. (2018). Additionally, Hu et al. (2019) find that cryptocurrency portfolios constructed via optimizations that minimize variance and expected shortfall outperform a major stock market index (the S &P 500).

Perhaps stating the obvious, in recent years, machine learning is playing an ever-increasing role in finance. In particular, deep learning offers a promising alternative to standard financial models; see the results in Heaton et al. (2017). However, such optimism is tempered by the results in Gu et al. (2020), where one sees that the performance difference between decent linear factor models and use of non-linear machine learning tools is not substantially large. It is important to note that Gu et al. (2020) assume a non-stock-specific model structure, and use stock-specific characteristics instead of common factors as predictor variables. Further, their sample universe contains a wide variety of different companies, from potentially short-living micro-stocks, up to well-established large stocks. The difference in performance between classical models and sophisticated machine learning methods is even smaller when using stock-specific models, and focusing only on long-living companies; see De Nard et al. (2020).

In this paper, we introduce the so called Factor-Heterogeneous-Tails-Generalized-Hyperbolic (Factor-HGH) model. It is a statistically interpretable factor model that extends classical financial factor models such as Fama and French (1993) and Fama and French (2015) by incorporating heavy-tails and non-ellipticity in the form of heterogeneous tails. Contrary to classical factor models, our approach builds a joint model of both factors and asset returns. This is done assuming the so-called HGH distribution of Näf et al. (2019) (subsequently discussed). As in the classical factor models, we then obtain the predictive distribution as the marginal distribution of returns. This however requires the marginal moment generating function (mgf) based on two random vectors that are jointly modeled as HGH, which we derive in Sect. 3. Thus, we provide a useful technical extension of the HGH class of models, and, crucially, apply this in the context of factors in finance, allowing us to assess if and to what extent it conveys a benefit in terms of out of sample performance. We also further generalize the model into a framework that nests the Gaussian and HGH distribution. As a special case, assuming joint Gaussianity, the classical factor model emerges. An interesting benefit of our modeling approach is that it avoids the use of copula, and thus, is able to derive exact distribution theory, obviating the need for simulation. The model parameters associated with our resulting multivariate non-elliptical distribution can be estimated (quickly and efficiently no less, thus rendering our model suitable for large dimensions) by an ECME algorithm. This in turn permits computation of portfolio weights using expected shortfall (ES) as the risk measure. The computation of these weights when using ES requires generic optimization routines, and is thus feasible on deskop PCs for medium-sized dimension, e.g., on the order of, say, 100.

Compared to standard machine learning methods as in Gu et al. (2020), which build the portfolios based on the raw excess return predictions, we focus on the prediction of the joint multivariate distribution of the factors and the returns of the assets in the portfolio. Moreover, all of this probabilistic information is used in building the portfolio through the minimization of the expected shortfall (ES). Modelling factors and asset returns jointly is not entirely new: See Pourahmadi (1999) for what appears to be the first contribution in this regard; and Darolles et al. (2018) for an extended version.

The paper proceeds as follows: Section 2 presents the model in full generality, explains the parameter estimation procedure, and connects the new model to the common ones. Section 3 focuses on how the model, despite its seeming complexity, can straightforward be used to generate portfolio weights. Section 4 introduces a model extension. Section 5 details an empirical investigation, showcasing the benefit of jointly modeling factors and asset returns, and doing so under a non-Gaussian distribution assumption. Section 6 concludes. Appendix 1 contains details on model estimation. Appendix 2 provides proofs of theoretical results deployed in the paper. Appendix 3 contains the list of cryptocurrencies. Finally, Appendix 4 presents robustness tests and further performance statistics.

2 Model

Let \(\textbf{f}_t\) and \(\textbf{r}_t\) denote the value of \(K_f\) factors, and excess simple returns of \(K_r\) assets, at time \(t=1,\ldots , T\), respectively. We consider a joint multivariate model for \(\textbf{Z}_{t} =\left[ \textbf{f}_t^\top , \textbf{r}_t^\top \right] ^\top\) given by

$$\begin{aligned} \textbf{Z}_{t}= \varvec{\mu } + \textbf{L} \varvec{\nu }_t, \end{aligned}$$
(1)

where

$$\begin{aligned} \varvec{\mu } = \begin{bmatrix} \varvec{\mu }^{(f)} \\ \varvec{\mu }^{(r)} \end{bmatrix},\ \textbf{L} = \left[ \begin{array}{c|c} \textbf{L}^{(f)} &{} \textbf{0} \\ \hline \textbf{B} &{} \textbf{L}^{(r)} \\ \end{array} \right] ,\ \varvec{\nu }_t= \begin{bmatrix} \varvec{\nu }^{(f)}_t \\ \varvec{\nu }^{(r)}_t \end{bmatrix}= \textbf{C}^{1/2} \varvec{\varepsilon }_t, \text{ and } \textbf{C} = \left[ \begin{array}{c|c} \textbf{C}^{(f)} &{} \textbf{0} \\ \hline \textbf{0} &{} \textbf{C}^{(r)} \\ \end{array} \right] , \end{aligned}$$

each component of which is now defined. Vectors \(\varvec{\mu }^{(f)}\) and \(\varvec{\mu }^{(r)}\) correspond to the expected factor values and expected returns, respectively. Matrices \(\textbf{L}^{(f)}\), \(\textbf{L}^{(r)}\), and \(\textbf{C}^{(f)}\), \(\textbf{C}^{(r)}\), are \(K_f\times K_f\), and \(K_r\times K_r\), respectively, defined as:

$$\begin{aligned} \textbf{L}^{(\cdot )}= \begin{pmatrix} 1 &{} 0 &{} 0 &{} \ldots &{} 0 \\ q_{21}^{(\cdot )} &{} 1 &{} 0 &{} \ldots &{} 0 \\ q_{31}^{(\cdot )} &{} q_{32}^{(\cdot )} &{} 1 &{} \ldots &{} 0 \\ &{} \ddots &{} \ddots &{} \ddots &{} 0\\ q_{K_{(\cdot )}1}^{(\cdot )} &{} q_{K_{(\cdot )}2}^{(\cdot )} &{} \ldots &{} q_{K_{(\cdot )}K_{(\cdot )}-1}^{(\cdot )}&{} 1 \end{pmatrix},\quad \textbf{C}^{(\cdot )}= \begin{pmatrix} c_1^{(\cdot )}&{} &{} \\ &{} \ddots &{} \\ &{} &{} c^{(\cdot )}_{K_{(\cdot )}} \end{pmatrix}, \end{aligned}$$
(2)

with each \(\textbf{C}^{(\cdot )}\) being diagonal with elements \(c^{(\cdot )}_i>0\), for \(i=1,\ldots , K_{(\cdot )}\). Matrix \(\textbf{B}\) is \(K_r \times K_f\), given by:

$$\begin{aligned} \textbf{B} = \begin{pmatrix} b_{1,1} &{} \ldots &{} b_{1,K_f}\\ \vdots &{} \ddots &{} \vdots \\ b_{K_r,1} &{} \ldots &{} b_{K_r,K_f} \end{pmatrix}. \end{aligned}$$
(3)

Finally,

$$\begin{aligned} \varvec{\varepsilon }_t = \left[ \varvec{\varepsilon }^{(f)\top }_t, \varvec{\varepsilon }^{(r)\top }_t \right] ^\top = \left[ \varepsilon ^{(f)}_{1,t}, \ldots ,\varepsilon ^{(f)}_{K_f,t}, \varepsilon ^{(r)}_{1,t},\ldots ,\varepsilon ^{(r)}_{K_r,t}\right] ^\top \end{aligned}$$

denotes the error term, these being mean zero and independent and identically distributed (iid) over time. In this paper, we choose to restrict \(\textbf{C}\) to be time-invariant, in order to concentrate on the efficacy of the new approach and its features, without introducing a further generalization. However, all models discussed here may incorporate GARCH dynamics into \(\textbf{C}\) to allow for time-varying conditional variance of the error term (see also the Appendix in (Näf et al., 2019)).

First consider model (1) with iid Gaussian errors, i.e., \(\varvec{\varepsilon }_t \sim N(\textbf{0}, \textbf{I})\) (referred to as Gaussian–Cholesky). This is in fact a special case of the model in Darolles et al. (2018). Maximum likelihood estimation of the parameters in this Gaussian special case of our model can be easily achieved in two steps, as detailed in Appendix 1. It is worth noting that we can allow for sparse \(\textbf{L}\), by incorporating \(\ell _1\) regularization into the estimation. This generalization corresponds to a maximum a posteriori estimation with Laplace prior (Murphy , 2012).

If we further impose \(\varvec{L}^{(r)} = \textbf{I}\), then the model in (1) simplifies, and corresponds to a “standard” financial factor model with time series regression of the returns on the factors, and normally distributed error term

$$\begin{aligned} r_{i,t}&= \beta _{i,0} + \varvec{\beta }_i^{\top } \textbf{f}_{t} + \epsilon _{i,t}, \end{aligned}$$
(4)

where \(\varvec{\beta }_i = [\beta _{i,1},\ldots ,\beta _{i,K_f}]^\top\), and \(\epsilon _{i,t} {\sim } N(0, \sigma _i^2)\) iid over t, for \(i=1,\ldots ,K_r\), as derived by Ross (1976, 1977) using the Arbitrage Pricing Theory (APT); and by Chamberlain and Rothschild (1983) in a large economy setting. Indeed, for \(\varvec{L}^{(r)} = \textbf{I}\), the model in (1) reduces to

$$\begin{aligned} \textbf{r}_t&= \varvec{\mu }^{(r)} + \textbf{B} \varvec{\nu }_t^{(f)} + \varvec{\nu }_t^{(r)}\nonumber \\&=\left( \varvec{\mu }^{(r)} - \textbf{B} (\textbf{L}^{(f)})^{-1}\varvec{\mu }^{(f)}\right) + \textbf{B} (\textbf{L}^{(f)})^{-1} \textbf{f}_t + \varvec{\nu }_t^{(r)}. \end{aligned}$$
(5)

Since by assumption \(\varvec{\nu }_t^{(r)} = (\textbf{C}^{(r)})^{1/2}\varvec{\varepsilon }_t^{(r)} {\sim } N(\textbf{0}, \textbf{C}^{(r)})\), iid over t, with \(\textbf{C}^{(r)}\) being a diagonal matrix, it follows that the regressions obtained in (4) and (5) are equivalent, with

$$\begin{aligned} \beta _{i,0} = \left[ \varvec{\mu }^{(r)} - \textbf{B} (\textbf{L}^{(f)})^{-1}\varvec{\mu }^{(f)}\right] _i, \ \ \ \varvec{\beta }_i^{\top } = \left[ \textbf{B} (\textbf{L}^{(f)})^{-1} \right] _{i, \bullet }, \end{aligned}$$
(6)

where \(\left[ \textbf{A} \right] _{i, \bullet }\), denotes the ith row of a matrix \(\textbf{A}\).

One may also compare the resulting mean and covariance of \(\textbf{r}_t\) of the two approaches: combining (1) with (6), we find that

$$\begin{aligned} {\mathbb E}[\textbf{r}_t]=\varvec{\mu }^{(r)}&= \begin{pmatrix} \beta _{1,0}\\ \vdots \\ \beta _{K_{r},0} \end{pmatrix} + \begin{pmatrix} \varvec{\beta }_{1}^{\top }\\ \vdots \\ \varvec{\beta }_{K_{r}}^{\top } \end{pmatrix} {\mathbb E}[\textbf{f}_t] \end{aligned}$$
(7)

and

$$\begin{aligned} \text{ Cov }(\textbf{r}_t)&= \begin{pmatrix} \varvec{\beta }_{1}^{\top }\\ \vdots \\ \varvec{\beta }_{K_{r}}^{\top } \end{pmatrix}\text{ Cov }(\textbf{f}_t) \begin{pmatrix} \varvec{\beta }_{1}&\cdots&\varvec{\beta }_{K_{r}} \end{pmatrix} + \textbf{C}^{(r)}, \end{aligned}$$
(8)

which are two moments obtained from the regression in (4). Formula (8) is the basis for many shrinkage approaches to estimate the covariance matrix of \(\textbf{r}_t\) (Ledoit and Wolf 2020, Section 5), and, for \(K_f < K_r\), the factors provide a dimension reduction as in Fan et al. (2008).

In the general (but still Gaussian) case with \(\varvec{L}^{(r)} \ne \textbf{I}\), (5) becomes

$$\begin{aligned} \textbf{r}_t&=\left( \varvec{\mu }^{(r)} - \textbf{B} (\textbf{L}^{(f)})^{-1}\varvec{\mu }^{(f)}\right) + \textbf{B} (\textbf{L}^{(f)})^{-1} \textbf{f}_t + \varvec{L}^{(r)} \varvec{\nu }_t^{(r)}, \end{aligned}$$
(9)

which is a linear regression with an error vector \(\varvec{L}^{(r)} \varvec{\nu }_t^{(r)}\) that is Gaussian with mean vector \(\textbf{0}\) and non-diagonal covariance matrix \(\varvec{L}^{(r)} \textbf{C}^{(r)} (\varvec{L}^{(r)})^{\top }\). Thus, instead of estimating the regression equation over the returns independently, they are interconnected through a correlation structure in the corresponding error terms as in seemingly unrelated regression equations (SURE) of Zellner (1962).

We now turn to the non-Gaussian case. Financial returns, and financial factors, measured at the daily or higher frequency, exhibit leptokurtosis and mild asymmetry. (This is illustrated below in Table 1.) In order to accommodate these stylized facts, we can impose in (1) a (semi-)heavy-tailed distribution such as the (symmetric) generalized hyperbolic (GHyp) distribution for the error terms. Its use was arguably popularized in McNeil et al. (2015), and continues to be used; see, e.g., Bianchi et al. (2020). That is, we take \(\varepsilon _{i,t} {\sim } {{\,\textrm{GHyp}\,}}(\lambda _i, \alpha _i, \delta _i, \mu _i)\), with the probability density function

$$\begin{aligned} f_{{{\,\textrm{GHyp}\,}}}(y; \lambda , \alpha , \delta , \mu )= \frac{{{\,\textrm{k}\,}}_{\lambda -\frac{1}{2}}\left( (y-\mu )^2 + \delta ^2,\, \alpha ^2\right) }{\sqrt{2\pi }{{\,\textrm{k}\,}}_{\lambda }\left( \delta ^2 ,\, \alpha ^2\right) }, \end{aligned}$$
(10)

with \(\lambda , \mu \in \mathbb {R}\) being the shape and location parameters, \(\alpha > 0\) the tail parameter and \(\delta > 0\) controls the shape of the p.d.f. near its mode. Further

$$\begin{aligned} k_{\lambda }(\chi , \psi ) = 2(\chi /\psi )^{\lambda /2}K_{\lambda }(\sqrt{\chi \psi }), \end{aligned}$$
(11)

and \(K_v(x)\) is the modified Bessel function of the third kind with index v, given for all \(x>0\) by

$$\begin{aligned} K_{v}(x) = \frac{1}{2}\int _0^{\infty } t^{v-1}e^{-x(t+t^{-1})/2} \text {d}t. \end{aligned}$$
(12)

For a more detailed discussion of the GHyp distribution, we refer to (Paolella 2007, Chapter 9). It is noteworthy that each marginal distribution is endowed with its own set of shape parameters, allowing for tail heterogeneity, thus differentiating it from the use of the multivariate generalized hyperbolic distribution, as used in Paolella and Polak (2015b). Model (1) with GHyp innovations results in the HGH model of Näf et al. (2019) for the joint distribution of the returns and financial factors. This distributional construction was proposed independently by Schmidt et al. (2006) and Näf et al. (2019), the latter authors not having known about the former. The set of parameters to estimate consists of

$$\begin{aligned} \varvec{\mu }=\left[ \varvec{\varvec{\mu }}^{(f)\top }, \varvec{\varvec{\mu }}^{(r)\top } \right] ^\top ,\ \textbf{L} = \left[ \begin{array}{c|c} \textbf{L}^{(f)} &{} \textbf{0} \\ \hline \textbf{B} &{} \textbf{L}^{(r)} \\ \end{array} \right] ,\ \varvec{\Phi }=\begin{pmatrix} \varvec{\alpha }^{\top }&\varvec{\delta }^{\top }&\varvec{\lambda }^{\top }&\text{ diag }(\textbf{C})^{\top } \end{pmatrix}^{\top }, \end{aligned}$$
(13)

where \(\varvec{\Phi }\) gathers the parameters corresponding to the distribution of \(\varvec{\nu }_t\). Näf et al. (2019) derive an ECME algorithm to estimate the parameters iteratively, by exploiting the mixed normal representation of the GHyp distribution, with the mixing distribution being the generalized inverse Gaussian distribution (GIG). Its use yields a closed form expression for the conditional moments computed in the E-step of the algorithm, as well as the lower-triangular form of matrix \(\textbf{L}\). This allows for sequential updating of the parameters in \(\textbf{L}\). In Appendix 2, we adapt the algorithm to estimate the parameters in (13) that characterize the joint multivariate distribution of the financial factors and the returns of the assets in the portfolio. The returns and factors exhibit relation (9), but now with errors in the regression that are both correlated and heavy-tailed.

3 Portfolio optimization

For a given set of \(K_r\) assets, we would like to produce a portfolio vector \(\textbf{w}\), using the additional information of \(K_f\) factors. In addition, we impose a short-selling constraint, this being, firstly, quite common in practice; see, e.g., Almazan et al. (2004), who report that 70% of mutual funds explicitly state that short-selling is not permitted; and, secondly, its imposition can be interpreted as a useful form of shrinkage that leads to better performance; see, e.g., Jagannathan and Ma (2003), and DeMiguel et al. (2009). We find optimal portfolio weights by minimizing, for a given level \(\alpha \in (0,1)\), the expected shortfall of the portfolio returns

$$\begin{aligned} \min _{\textbf{w} \in \mathfrak {W}_{\theta }} \text{ ES}_{\alpha } (\textbf{w}^\top \textbf{r}_{t+1}), \end{aligned}$$
(14)

conditional on the information up to time t. \(\mathfrak {W}_{\theta }\) characterizes the set of feasible portfolios for a long-only strategy

$$\begin{aligned} \mathfrak {W}_{\theta }=\big \{\textbf{w}\in \mathbb {R}^{K_r}:\ \textbf{w}'\varvec{\mu }^{(r)} \ge \theta , \quad \sum _{k=1}^K w_k =1,\ w_k\ge 0, \text{ for } k=1,\ldots ,K \big \}, \end{aligned}$$
(15)

which includes lower bound for the expected portfolio return, fully invested portfolio, and the short-selling constraint. In the case of Gaussian errors from Sect. 2, as shown in Embrechts et al. (2002), (14) is equivalent to a min-variance portfolio

$$\begin{aligned} \min _{\textbf{w} \in \mathfrak {W}_{\theta }} \text{ Var } (\textbf{w}^\top \textbf{r}_{t+1})=\min _{\textbf{w} \in \mathfrak {W}_{\theta }} \textbf{w}^\top \text{ Cov } (\textbf{r}_{t+1}) \textbf{w}, \end{aligned}$$
(16)

with, according to our model,

$$\begin{aligned} \text{ Cov } (\textbf{r}_{t+1}) = \textbf{B} \textbf{C}^{(f)} \textbf{B}^{\top } + \textbf{L}^{(r)}\textbf{C}^{(r)} (\textbf{L}^{(r)})^{\top }. \end{aligned}$$

In the case of a non-elliptical HGH model, as shown in Näf et al. (2019), direct calculation of the expected shortfall in (14) requires numerical integration over a density that has itself an infinite series representation. Therefore, we use the Rockafellar and Uryasev (2000) result, together with the saddle point approximation from Broda and Paolella (2009). In particular, let \(S_t= \textbf{w}^\top \textbf{r}_{t}\) denote the portfolio return at time t and y a generic variable. Then we use

$$\begin{aligned}&F_{\alpha }(y;\textbf{w})=- \textbf{w}^\top \varvec{\mu } + y\frac{2 \alpha - 1}{2 \alpha } + \frac{1}{\pi \alpha } \int _{0}^{+\infty } \mathop{\textrm{Im}} \left( \frac{{\mathbb M}_{S_t}(iz)}{e^{-iz \textbf{w}^\top \varvec{\mu }}} ({\mathbb K}_{S_t}'(iz)-\textbf{w}^\top \varvec{\mu } + y) e^{-izy} \right) \frac{\text {d}z}{z}, \end{aligned}$$
(17)

where \({\mathbb M}_{S}\) and \({\mathbb K}_{S}\) denote, respectively, the moment generating function (mgf) and the cumulant generating function (cgf) of random variable S and \({\mathbb K}_{S_t}'\) the first derivative of \({\mathbb K}_{S_t}\). Solving the minimization problem

$$\begin{aligned} (y^*,\textbf{w}^*)=\mathop {\mathrm {arg\,min}}\limits _{(y,\textbf{w}) \in {\mathbb R}\times \mathfrak {W}_{\theta }} F_{\alpha }(y;\textbf{w}), \end{aligned}$$
(18)

gives back the optimal portfolio \(\textbf{w}^*\). Moreover, the ES of the optimal portfolio is given as \(\mathop{\textrm{ES}}_{\alpha }(\textbf{w}^{*\top } \textbf{r}_{t+1})=\min _{(y,\textbf{w})} F_{\alpha }(y,\textbf{w})\). Evaluation of (17) requires (only) computable expressions for \({\mathbb M}_{S_t}\) and \({\mathbb K}_{S_t}\) of the portfolio distribution. The distribution of \(\textbf{r}_t\) is not HGH anymore, but we can still derive the mgf of \(\textbf{r}_t\), \({\mathbb M}_{\textbf{r}_t}\), which is necessary for portfolio optimization. Define \(\textbf{0}_{k_1 \times k_2}\) as a matrix of zeros of dimension \(k_1 \times k_2\), \(\textbf{I}_{k_1 \times k_1}\) to be the identity matrix of dimension \(k_1 \times k_1\) and

$$\begin{aligned} \textbf{A}_1&: = \begin{pmatrix} \textbf{I}_{K_{f} \times K_{f}}&\textbf{0}_{K_{f} \times K_{r}} \end{pmatrix}, \ \ \textbf{A}_2 = \begin{pmatrix} \textbf{0}_{K_{r} \times K_{f}}&\textbf{I}_{K_{r} \times K_{r}} \end{pmatrix}, \end{aligned}$$
(19)

such that \(\textbf{f}_t=\textbf{A}_1 \textbf{Z}_t\), \(\textbf{r}_t=\textbf{A}_2 \textbf{Z}_t\), \(\textbf{L}^{(f)}= \textbf{A}_1 \textbf{L} \textbf{A}_1^{\top }\), \(\textbf{L}^{(r)}= \textbf{A}_2 \textbf{L} \textbf{A}_2^{\top }\) and \(\textbf{B}= \textbf{A}_1 \textbf{L} \textbf{A}_2^{\top }\).

Lemma 1

Let \(\textbf{X}, \textbf{Y}\) be random vectors in \({\mathbb R}^{K_{f}}\) and \({\mathbb R}^{K_{r}}\) respectively, such that \(\textbf{Z}=[\textbf{X}, \textbf{Y}] \sim \mathop{\textrm{HGH}}(\varvec{\mu }, \varvec{\Phi }, \textbf{L})\), with \(K=K_{f}+K_{r}\). Then the mgf of \(\textbf{Y}\) is given as

$$\begin{aligned} {\mathbb M}_{\textbf{Y}}(\textbf{u})=\exp (\textbf{u}^\top \varvec{\mu }^{(\textbf{Y})}) \prod _{i=1}^{K}\left( \frac{\alpha _i^2}{\alpha _i^2 - \tilde{u}_i^2} \right) ^{\lambda _i/2} \frac{K_{\lambda _i}(\delta _i \sqrt{ \alpha _i^2 - \tilde{u}_i^2})}{K_{\lambda _i}(\delta _i \alpha _i)}, \end{aligned}$$
(20)

with, for \(i=1,\ldots ,K\),

$$\begin{aligned} \tilde{u}_i={\left\{ \begin{array}{ll}c_{ii,t}^{1/2} \sum _{\ell =k+1}^K u_{\ell } q_{\ell i}, &{} \text { if \;} i < k+1, \\ c_{ii,t}^{1/2} \left( u_i+ \sum _{\ell =i+1}^K u_k q_{\ell i} \right) , &{} \text { if \;} i\ge k+1. \end{array}\right. } \end{aligned}$$

The changes induced by considering factors very much resembles the Gaussian case. For instance, and similar to the comparable result stated in Sect. 2, it holds that

$$\begin{aligned} \text{ Cov }(\textbf{r}_t)= \textbf{B} \text{ Cov }(\varvec{\nu }^{(f)}_t) \textbf{B}^{\top } + \textbf{L}^{(r)} \textbf{C}^{(r)} (\textbf{L}^{(r)})^{\top }. \end{aligned}$$
(21)

In general, the mgf in (20) for \(\textbf{Y}=\textbf{r}_t\) depends on \(\textbf{f}_t\) through \(q_{\ell i }\), for \(\ell \ge k+1\) and \(i < k+1\), which corresponds to the effects of the factors on the returns. The proof of Lemma 1 can be found in Appendix 2.

4 Hybrid models

The distributions for \(\varvec{\varepsilon }_t\) in (1) considered so far can be seen as two extremes on a spectrum. To see this we rewrite (1) as

$$\begin{aligned} \textbf{Z}_{t}= \varvec{\mu } + \textbf{L} \textbf{C}^{1/2} \textbf{D}_t^{1/2} \varvec{\epsilon }_t, \end{aligned}$$
(22)

with \(\varvec{\epsilon }_t {\sim } N(\textbf{0}, \textbf{I})\) and

$$\begin{aligned} \textbf{D}_t= \text{ diag }(G_{1,t}, \ldots , G_{K,t}). \end{aligned}$$
(23)

The relevant difference between those distributions in our context, lies in the choice of mixture variables G. For the Gaussian distribution \(G_{1,t}= \ldots = G_{K,t}=1\), while for the HGH, \(G_{1,t}, \ldots , G_{K,t}\) are independently GIG. On the other hand, one might also imagine that \(G_{1,t}= \ldots = G_{K,t}=G\), where G is again GIG. Generalizing this, we might divide the marginals into groups of equal tail behavior, such that the Gaussian case would correspond to one group and the HGH case to K groups.

In general, we impose the following rules on the joint distribution of \(G_{1,t}, \ldots , G_{K,t}\) to nest all the model structures above: For any k, either let \(G_{k,t} {\sim } {{\,\textrm{GIG}\,}}( \lambda _k, \alpha _k, \delta _k )\), or set \(G_{k,t}=1\). Moreover for each group of size \(s < K\), \((G_{k_1,t}, \ldots G_{k_s,t})\) either is independent or perfectly dependent, in the sense that \(G_{k_1,t}= \ldots =G_{k_s,t}\). That is, there exists \(j=1,\ldots , d\) independent random variables \(\textbf{G}_{1,t}, \ldots , \textbf{G}_{d,t}\) such that either \(\textbf{G}_{j,t} {\sim } {{\,\textrm{GIG}\,}}( \lambda _j, \alpha _j, \delta _j )\) for all t or \(\textbf{G}_{j,t}=1\) for all t. Let for the following,

$$\begin{aligned} \begin{aligned} \mathcal {K}_{0}&=\{k: G_{k,t} = 1 \}\\ \mathcal {K}_{1}&=\{k: G_{k,t} = \textbf{G}_{1,t} \sim {{\,\textrm{GIG}\,}}\Big (\lambda _1, \alpha _1, \delta _1 \Big )\}\\ \mathcal {K}_{2}&=\{k: G_{k,t} = \textbf{G}_{2,t} \sim {{\,\textrm{GIG}\,}}\Big ( \lambda _2, \alpha _2, \delta _2 \Big )\}\\&\vdots \\ \mathcal {K}_{d}&=\{k: G_{k,t} = \textbf{G}_{d,t} \sim {{\,\textrm{GIG}\,}}\Big (\lambda _d, \alpha _d, \delta _d \Big )\}, \end{aligned} \end{aligned}$$
(24)

where \(\mathcal {K}_{0}\) is allowed to be empty. Then, for a given choice of these sets, the model can be estimated by the ECME algorithm presented in Appendix 3. We will denote the resulting distribution with \(\textbf{Z} {\sim } \mathop{\textrm{HGH}}\left( \varvec{\mu }, \varvec{\Phi }, \textbf{L}, \mathcal {K} \right)\) with \(\mathcal {K}=\left( \mathcal {K}_j\right) _{j=0}^{d}\).

Note again that for \(G_{k,t}=1\) for all \(k=1,\ldots , K\) and \(t=1,\ldots ,T\), this gives the Gaussian model of Sect. 2 with its variations depending on the choice of \(\textbf{L}\). On the other hand, choosing \(G_{k,t} {\sim } {{\,\textrm{GIG}\,}}( \lambda _k, \alpha _k, \delta _k )\) independent for all k results in the HGH model. Alternatively, we might choose to only allow for \(G_{1,t} {\sim } {{\,\textrm{GIG}\,}}( \lambda _1, \alpha _1, \delta _1 )\), while for \(k>1\), \(G_{k,t}=1\) for all t. This would mean that there is a single shock \(G_{1,t}\) propagating through the whole distribution, but only ever working through \(\varepsilon _{1,t}\). Finally imagine \(G_{1,t}= \ldots = G_{K,t}= \textbf{G}_{t} {\sim } {{\,\textrm{GIG}\,}}( \lambda , \alpha , \delta )\). Now again there is just a single shock, but this time it affects all independent errors \(\varepsilon _{1,t}, \ldots , \varepsilon _{K,t}\) simultaneously. In this case \((\varepsilon _{1,t}, \ldots \varepsilon _{K,t}) {\sim } {{\,\textrm{MGHyp}\,}}(\lambda , \alpha , \delta , \textbf{0}, \textbf{I})\) and \(\varepsilon _{i,t} {\sim } N(\mu ,1)\), where \({{\,\textrm{MGHyp}\,}}(\lambda , \alpha , \delta , \varvec{\mu }, \varvec{\Sigma })\) denotes the (symmetric) multivariate generalized hyperbolic distribution, see e.g., Paolella and Polak (2015b) and McNeil et al. (2015).

Lemma 2

Define for \(\textbf{v}=\left( v_1,\ldots , v_K \right) \in {\mathbb R}^K\),

$$\begin{aligned} \tilde{v}_i=c_{ii}^{1/2} \left( v_i+ \sum _{k=i+1}^K v_k q_{ki} \right) , \quad i=1,\ldots ,K, \end{aligned}$$

and

$$\begin{aligned} \omega _{j} = \left( \sum _{i \in \mathcal {K}_j} \tilde{v}_i^2\right) ^{1/2}. \end{aligned}$$
(25)

The moment generating function of \(\textbf{Z} {\sim } \mathop{\textrm{HGH}}\left( \varvec{\mu }, \varvec{\Phi }, \textbf{L}, \mathcal {K} \right)\) is given by

$$\begin{aligned}&{\mathbb M}_{\textbf{Z}}(\textbf{v})=\exp (\textbf{v}^\top \varvec{\mu }) \exp (\omega _0^2/2) \cdot \prod _{j=1}^d \left( \frac{\alpha _j^2}{\alpha _j^2 - \omega _j^2} \right) ^{\lambda _j/2} \frac{K_{\lambda _j}(\delta _j \sqrt{ \alpha _j^2 - \omega _j^2})}{K_{\lambda _j}(\delta _j \alpha _j)}, \end{aligned}$$
(26)

for \(\textbf{v}\) such that \(\omega _j \in (-\alpha _j, \alpha _j)\) for all j. This leads to the cumulant generating function

$$\begin{aligned} {\mathbb K}_{\textbf{Z}}(\textbf{v})&=\textbf{v}^\top \varvec{\mu } + \frac{\omega _0^2}{2} + \sum _{j=1}^d \left\{ \frac{\lambda _j}{2} \ln \left( \frac{\alpha _j^2}{\alpha _j^2 - \omega _j^2} \right) + \ln K_{\lambda _j}(\delta _j \sqrt{ \alpha _j^2 - \omega _j^2})-\ln K_{\lambda _i}(\delta _i \alpha _i) \right\} . \end{aligned}$$

The proof of Lemma 2 can be found in Appendix 2. In our context of factors, one can again derive the marginal mgf for this model class:

Lemma 3

Let \(\textbf{X}, \textbf{Y}\) be random vectors in \({\mathbb R}^{K_{f}}\) and \({\mathbb R}^{K_{r}}\) respectively, such that \(\textbf{Z}=[\textbf{X}, \textbf{Y}] \sim \mathop{\textrm{HGH}}(\varvec{\mu }, \varvec{\Phi }, \textbf{L}, \mathcal {K})\), with \(K=K_{f}+K_{r}\). Then the mgf of \(\textbf{Y}\) is given as

$$\begin{aligned}&{\mathbb M}_{\textbf{Y}}(\textbf{u})=\exp (\textbf{u}^\top \varvec{\mu }) \exp (\omega _0^2/2) \cdot \prod _{j=1}^d \left( \frac{\alpha _j^2}{\alpha _j^2 - \omega _j^2} \right) ^{\lambda _j/2} \frac{K_{\lambda _j}(\delta _j \sqrt{ \alpha _j^2 - \omega _j^2})}{K_{\lambda _j}(\delta _j \alpha _j)}, \end{aligned}$$

where for \(j =0, \ldots , d\), \(\omega _{j}\) is defined as in (25) with, for \(i=1,\ldots ,K\),

$$\begin{aligned} \tilde{u}_i={\left\{ \begin{array}{ll}c_{ii,t}^{1/2} \sum _{\ell =k+1}^K u_{\ell } q_{\ell i}, &{} \text { if \;} i < k+1, \\ c_{ii,t}^{1/2} \left( u_i+ \sum _{\ell =i+1}^K u_k q_{\ell i} \right) , &{} \text { if \;} i\ge k+1. \end{array}\right. } \end{aligned}$$
(27)

From the above it follows that the linear combination \(S=\textbf{w}^\top \textbf{Y}\) has cgf

$$\begin{aligned} {\mathbb K}_{S}(t)&=\log ({\mathbb E}[\exp (t \textbf{w}^\top \textbf{Y})]) \nonumber \\&=t \textbf{w}^\top \varvec{\mu } + \frac{t^2\omega _0^2}{2} + \sum _{j=1}^d \left\{ \frac{\lambda _j}{2} \ln \left( \frac{\alpha _j^2}{\alpha _j^2 - t^2\omega _j^2} \right) + \ln K_{\lambda _j}(\delta _j \sqrt{ \alpha _j^2 - t^2\omega _j^2})-\ln K_{\lambda _i}(\delta _i \alpha _i) \right\} \end{aligned}$$
(28)

with \(\omega _{j} = \left( \sum _{i \in \mathcal {K}_j} \tilde{w}_i^2\right) ^{1/2}\), \(\tilde{w}_i\), \(i=1,\ldots ,K\), defined analogously to (27). This can be used for ES-based portfolio optimization as outlined in Sect. 3.

As an example, consider the case when there is one G only and \(\textbf{C}=\textbf{I}\) for simplicity. In the case when this G is influencing all returns simultaneously, but is not present in the factors, it holds that

$$\begin{aligned} \textbf{Y} {\mathop {=}\limits ^{d}} \varvec{\mu }^{(\textbf{Y})} + \textbf{B} \varvec{\epsilon }^{(\textbf{X})} + \sqrt{G} \textbf{L}^{(\textbf{Y})} \varvec{\epsilon }^{(\textbf{Y})}. \end{aligned}$$
(29)

If we instead have a joint G in the factors and returns, then

$$\begin{aligned} \textbf{Y} {\mathop {=}\limits ^{d}} \varvec{\mu }^{(\textbf{Y})} + \sqrt{G} \left( \textbf{B} \varvec{\epsilon }^{(\textbf{X})} + \textbf{L}^{(\textbf{Y})} \varvec{\epsilon }^{(\textbf{Y})} \right) , \end{aligned}$$
(30)

so that

$$\begin{aligned} \textbf{Y} \sim {{\,\textrm{MGHyp}\,}}(\lambda , \alpha , \delta , \varvec{\mu }^{(\textbf{Y})}, \textbf{B} \textbf{B}^{\top } + \textbf{L}^{(\textbf{Y})} \textbf{L}^{(\textbf{Y}) \top } ), \end{aligned}$$

which may formally be checked using the mgf from Lemma 2.

We note that we assume \(\mathcal {K}\) to be known beforehand. This allows for an introduction of prior knowledge into the model as groups of assets can be formed, sharing a common latent variable. This not only leads to an additional dependence structure for each group, but can also be used to enforce a common tail behavior for a group if \(\textbf{L}\) is appropriately constrained. This might make sense, for instance, if different asset classes are involved.

5 Empirical results

In a comprehensive empirical analysis, we compare our proposed modeling approach with classical methods from the literature. We consider two datasets, the 49 daily industry portfolios (FF49) from the Kenneth R. French homepage and the 42 hly cryptocurrencies from the Bitfinex homepage such that at least half a year of observations are available; see Table 3 in Appendix 3. We excluded the cryptocurrencies that exhibit almost no variation over time, like DAI. In the case of the FF49 dataset, we consider the Fama–French three factors (FF-3), the Fama–French five factors (FF-5) and an additional Momentum factor (FF-3 + MOM and FF-5 + MOM). So the common risk factors include the excess return on the market factor, which includes all NYSE, AMEX, and NASDAQ firms (Mkt-RF), the high minus low factor (HML), the conservative minus aggressive factor (CMA), the robust minus weak factor (RMW), the small minus big factor (SMB) and the momentum factor (MOM). Again, the data is downloaded from the Kenneth R. French homepage.

In case of the cryptocurrencies, we only consider the market factor (CMkt) as a predictor. Similar to Shen et al. (2020), the market factor is constructed from 65% Bitcoin Cash (BTC), 25% Ethereum (ETH) and giving equal weight to the remaining 10% in the universe. Table 1 shows the first four moments of the seven factors, with added bootstrap standard errors in brackets. It appears that the factors do not exhibit strong asymmetry as measured by the sample skewness, the use of which assumes existence of third moments. The sample kurtosis values serve as a heuristic to indicate the presence of strong leptokurtosis or the possible presence of heavy tails, though there appears to be no large discrepancy between the tail behaviors in terms of kurtosis between all factors of the FF49. We also would like to emphasize that pooling the data as we do here tends to drive up kurtosis values, whereas they appear closer to Gaussian for many of the much shorter moving windows for the FF49 data. On the other hand, the cryptocurrencies appear to exhibit heavy tails also in rolling windows. However, these remain heuristic arguments, as assessing the maximally existing moment of the underlying distribution is nearly futile (see, e.g., (McNeil et al., 2015; Paolella, 2016), and the numerous references therein). Interestingly, the market factor of the FF49 (Mkt-RF) and the market factor of the cryptocurrencies (CMkt) exhibit very similar behavior in these summary statistics. Overall, the summary statistics provide a heuristically based argument that modeling the factors themselves with a heavy-tailed distribution might be beneficial. Formal testing procedures will not help: They have very low power, and the mapping from a Neyman–Pearson test, or use of Fisherian p-values, to efficacy in forecasting is anyway not established. As such, it is not clear whether the ability to model vastly different tail behaviors of the HGH is needed for the six factors in the FF49 application. It might make more sense to model a single tail behavior for all factors; see Sect. 4. Finally, the risk-free rate is subtracted from the returns of the FF49 data.

Absent any immediate way to group assets, we focus on the two models described in Sect. 2 and conduct mean expected shortfall portfolio optimization, as described in Sect. 3, setting \(\theta = 0\) and \(\alpha = 0.15\). In case of the HGH model structure, we focus on a subclass of our distribution by taking each \(G_k\) to be standard gamma distributed, i.e., \(G_k {\sim } {{\,\mathrm{\mathop{\textrm{Gam}}}\,}}(\lambda _k,1)\) independent for all \(k\in \{1,\ldots ,K\}\). The gamma distribution arises as a special case of the \({{\,\textrm{GIG}\,}}(\lambda , \alpha , \delta )\) in the limit as \(\delta \rightarrow 0\).

The results are summarized in Table 2 and Fig. 1. As performance measures, we report the average annualized return (Return), the annualized standard deviation (Volatility), the final return (Total Return), the maximum drawdown (Max. Drawdown), the annualized percentage turnover (Turnover), the annualized Sharpe ratio (Sharpe), the annualized Sortino ratio (Sortino), the annualized percentage STARR-ratio at the \(98.5\%\) level (STARR\(_{98.5 \%}\) %) and the annualized empirical expected shortfall at the \(98.5\%\) level (ES\(_{98.5 \%}\)). Note, the STARR-ratio is defined as the average return divided by the empirical expected shortfall at a given level. The Factor-HGH model is estimated with \(L^{(r)}\) equal to the identity matrix, while the Gaussian–Cholesky model has an unconstrained \(L^{(r)}\). We propose this method as the optimal combination between good performance and estimation complexity. Further, in the case of the FF49 dataset, the five Fama–French factors are used as predictors. For an extensive performance overview of all combinations of model structures and factors, see Appendix 4.

Table 1 Various sample statistics of the 5787 respectively 3601 percentage returns for the \(K_f=6\) factors of the FF49 and \(K_f=1\) factor of the cryptocurrencies under study
Table 2 Comparison of the gross performance for mean-expected shortfall porfolios for the FF49 dataset and the crypto dataset
Fig. 1
figure 1

Gross percentage long-only cumulative portfolio returns for the 49 industry portfolios and the 42 cryptocurrencies. The industry portfolios contain daily observations from 03-01-2000 until 31-12-2022. The cryptocurrencies contain hourly observations from 01-01-2021 until 31-05-2021. The Factor-HGH model is estimated with \(L^{(r)}\) equal to the identity matrix

An issue that arises for the practical use of the HGH model (and thus also in the Factor-HGH) is the choice of the ordering of marginals. For this rolling window exercise, the reordering of the components in \(\textbf{Z}_t\) is done beforehand with 1’000 datapoints. On those datasets, the same approach as in Näf et al. (2019) is used, but separately for factors and returns. That is, while we change the ordering within factors and returns by a permutation matrix \(\textbf{P}\), we still have that for \(t=1, \ldots , T\),

$$\begin{aligned} \textbf{P}\textbf{Z}_t=\begin{bmatrix} \textbf{P}_1 \textbf{f}_t\\ \textbf{P}_2 \textbf{r}_t \end{bmatrix} \end{aligned}$$

for two permutation matrices \(\textbf{P}_1\), \(\textbf{P}_2\).

In the case of the FF49 dataset, using the FF5 factors, we barely see a performance difference between the model approaches. Even more surprisingly, the intercept-only model (using no factor information at all) is performing similarly compared to the more sophisticated model structures. Although better than the base HGH model, which uses no factor information, our Factor-HGH model falls behind all the classical Gaussian models. It seems that heavy tail adjusted residuals bring no benefit in this rather Gaussian universe. Though, with a Sharpe-ratio difference of only 0.08 to the Fama–French model, the damage appears limited. Further, the classical Fama–French model is slightly better than the Gaussian–Cholesky model structure, in terms of SR. Finally, as can be studied in Appendix 4, the model ranking is consistent among the different factor structures FF-3, FF-3 + MOM, FF-5 and FF-5 + MOM.

If we switch to cryptocurrencies, which by all means fulfill stylized facts like heavy tails and varying tail behavior even on a rolling window level, our proposed model structures start to excel. Compared to the benchmark models, the portfolio based on the Factor-HGH model almost doubles the average return while keeping the volatility, the maximum drawdown, the turnover, and the expected shortfall at a low level. Notably as of the beginning of January 2021, we observe that the two HGH models are more robust against downside risk. Also, the cumulative return climbs way above the Gaussian models afterward. However, it must be emphasized that during the period of mid-January until the beginning of February the volatility of the HGH models is considerably higher than for the Gaussian models. Further, in May, the Factor-HGH and the basic (no factor) HGH remain on a stable path, while the Gaussian models perform more volatile and their cumulative returns begin a gentle decline. The stable performance of the HGH-type models is due to the investment into the more stable coins, such as PAX. Compared to the base HGH, the Factor-HGH achieves this stable performance with a much lower turnover. This low turnover is remarkable also in comparison with the other models. It shows that, in terms of net returns, the difference between the Factor-HGH and the competing models is even more pronounced. As shown in Table 6 and Fig. 2 in Appendix 4, this strong performance in net gains remains, when looking at a longer period from 01-01-2020 until 31-05-2021. However, when looking at the more recent period between 01-07-2022 until 31-12-2022, the classical models perform better than the HGH model structures, in terms of out-of-sample average and volatility, see Fig. 3 and Table 7 in Appendix 4. Still, the turnovers of the Gaussian-based models are almost double than the turnover of the Factor-HGH.

If we would simply invest in the market factor, which is mainly driven by BTC, one would end up with much worse performance. Unlike in our crypto universe, for the 49 industry portfolios, investing in the market factor only does not lead to a catastrophic performance. However, the summary statistics of Mkt-RF and CMkt in Table 1 would not necessarily predict such a different outcome. This further highlights the difficulty of investing in a crypto universe.

In summary, when considering the 49 industry portfolios from the Kenneth R. French homepage, the Factor-HGH can keep up with classical Gaussian models. On the other hand, when we draw our attention to a more non-Gaussian scenario, like cryptocurrencies, we observe a sound performance benefit of the Factor-HGH model compared to the benchmark models.

6 Conclusion

This paper details a method for incorporating exogenous common factor information into a tractable model that allows for non-elliptic behavior in the factors and asset returns, notably semi-heavy tails, margin asymmetry, and, notably, heterogeneous tail behavior. Parameter estimation is straightforward and fast, via an ECME algorithm; as is conducting mean-ES portfolio optimization. The proposed model structure is shown to generalize several existing factor models, including the classical Gaussian factor models of Fama and French (1993), Carhart (1997), and Fama and French (2015).

The empirical analysis indicates that classical Gaussian models and the new Factor-HGH models perform similarly for the standard FF49 data set. Irrespective of the model used, incorporation of the common factor information does not lead to markedly improved performance in the FF49 universe: The intercept-only model is almost as good as the more sophisticated factor models. However, in the case of cryptocurrencies, where stylized facts such as heavy tails and heterogeneous tail behavior are rather pronounced, the proposed HGH structures clearly outperform the classical models. Intriguingly, the dimension reduction by modelling dependence only through the factors leads to a performance boost compared to the full HGH distribution. Further, in case of the cryptocurrency data, incorporation of the common market factor information into the model structure results in a clear performance increase. To conclude, in the world of cryptocurrencies, the proposed Factor-HGH opens an interesting playground for modeling factors and returns jointly under non-Gaussian errors.